QTM 447 Lecture 26: Diffusion Models

Kevin McAlister

April 17, 2025

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Generative Models

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

Success:

  • Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!

  • Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!

  • Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?

Generative Models

So far, we’ve talked about three types of generative models.

Autoregressive Models

\[ P(\mathbf x) = \prod \limits_{t = 1}^T P(x_t | x_1,x_2,...,x_{t-1}) \]

Advantages:

  • Directly compute and maximize \(P(\mathbf x)\)

  • Generates high quality images due to pixel by pixel generation strategy

Disadvantages:

  • Very slow to train

  • Very slow to generate high res images

  • No explicit latent code

Generative Models

Variational Autoencoders

\[ P(\mathbf x) \ge E_{Q}[\log P(\mathbf x | \mathbf z)] - D_{KL}(Q(\mathbf z | \mathbf x) || P(\mathbf z)) \]

\[ P(\mathbf x) = \int P(\mathbf x | \mathbf z) P(\mathbf z) dz \]

Advantages:

  • Fast image generation

  • Very rich latent codes

Disadvantages:

  • Maximizing on a lower bound, not necessarily close to the truth

  • Generated images often blurry due to averaging behavior

Generative Models

GANs

\[ \underset{\theta}{\text{min }} \underset{\phi}{\text{max }} \frac{1}{2} E_{P(\mathbf x)}[\log D_{\phi} (\mathbf x)] + E_{Q(\mathbf z)}[\log(1 - D_{\phi}(g_{\theta}(\mathbf z))] \]

  • \(g_{\theta}(\mathbf z)\) is a generator network that creates fake images

  • Train a discriminator to separate real and fake images

  • Maximize the power of the discriminator while also minimizing the JS divergence between the real and fake images

Generative Models

GANs

Advantages:

  • Fast image generation

  • Performs arbitrarily well due to lack of distributional assumptions

  • Good performance in image editing

Disadvantages:

  • Minimal control over generation

  • Training is a nightmare

Generative Models

Today, we’re going to quickly cover the new hotness - diffusion models

  • And quickly introduce the stable variant

Diffusion models are like a cross between VAEs and autoregressive models

  • Kinda sorta

Diffusion Models

Conceptually, diffusion models are quite simple.

  • Start with an initial image, \(\mathbf x\)

  • Sequentially add normal random noise to the image pixels

  • Repeat this process until the resulting image is a collection of random pixels

  • Learn a reverse mapping that decodes the random image back to its initial state

Seems simple…

Diffusion Models

Diffusion Models

Why this would work with a collection of images is a bit of a thinker.

Think about the decoder model:

  • Learn the best mapping from “random” back to the original image

  • Over a collection of images, think about the first move - try to recover a low level commonality among all images

  • Conditional on the first move, learn another low level commonality

  • Eventually, recover the original image up to arbitrary precision with enough decoder conditioning!

Diffusion Models

Two parts:

  • A prespecified (but stochastic) encoder that maps images to random space

  • A learnable decoder that inverts the decoder

The decoder is the hard part!

  • But, the structure of the encoder is going to dictate how the decoder works

Let’s attack these in order.

Diffusion Models

Diffusion Encoder

For notational simplicity, let \(\mathbf x = \mathbf z_0\)

Define a sequence of latent variables over \(T\) periods

\[ \{\mathbf z_0, \mathbf z_1, ..., \mathbf z_T\} \]

Diffusion Models

The forward process:

\[ \mathbf z_{t} = \sqrt{1 - \beta_t} \mathbf z_{t-1} + \sqrt{\beta_t} \epsilon_t \]

  • \(\beta_t\) is a mixing value between 0 and 1

  • \(\epsilon_t\) is a noise draw from a standard normal distribution (most frequently all pixels are assumed independent)

The first term attenuates the input (e.g. original signal) and the second term blends in noise!

  • \(\beta_t\) determines how quickly the noise is blended

Diffusion Models

More useful specification:

\[ Q(\mathbf z_t | \mathbf z_{t-1}) = \mathcal N(\mathbf z_t | \sqrt{1 - \beta_t} \mathbf z_{t-1} , \beta_t \mathcal I) \]

  • A sequence of conditional distributions relating each latent variable with the last

  • The variance of each RV is a function of a fixed set of mixing values

Thus, we can define a joint distribution over all latent variables given an input \(\mathbf z_0\) as:

\[ P(\mathbf z_1,\mathbf z_2,...,\mathbf z_T | \mathbf z_0) = \prod \limits_{t = 1}^T P(\mathbf z_t | \mathbf z_{t-1}) \]

  • A Markovian process

  • Our conditional chain reduces to one step behind

Diffusion Models

We could use this direct sequential process to encode images, but this would be slow at scale

  • \(\mathbf z_0\) to \(\mathbf z_1\) to \(\mathbf z_2\) and on and on and on

Fortunately, we can express the expected distribution given \(\mathbf z_0\) at any time \(t\) without needing to marginalize!

\[ \mathbf z_2 = \sqrt{1 - \beta_2} \mathbf z_1 + \sqrt{\beta_2} \epsilon_2 \]

\[ \mathbf z_1 = \sqrt{1 - \beta_1} \mathbf z_0 + \sqrt{\beta_1} \epsilon_1 \]

Omitting the algebra:

\[ \mathbf z_2 = \sqrt{(1 - \beta_2)(1 - \beta_1)} \mathbf z_0 + \sqrt{1 - \beta_2 - (1 - \beta_2)(1 - \beta_1)} \epsilon_1 + \sqrt{\beta_2} \epsilon_2 \]

Diffusion Models

We can think of the last two terms as normal distributions:

\[ \epsilon_1 + \epsilon_2 \sim \mathcal N(0 , 1 - \beta_2 - (1 - \beta_2)(1 - \beta_1)) + \mathcal N(0, \beta_2) \]

resulting in a single distribution (sums of normals):

\[ \epsilon_2 \sim \sqrt{1 - (1 - \beta_2)(1 - \beta_1)} \epsilon \]

meaning that our update equation becomes:

\[ \mathbf z_2 = \sqrt{(1 - \beta_2)(1 - \beta_1)} \mathbf z_0 + \sqrt{1 - (1 - \beta_2)(1 - \beta_1)} \epsilon \]

Diffusion Models

More broadly, let

\[ \tilde{\beta}_t = \prod \limits_{s = 1}^t (1 - \beta_s) \]

Then, we can define:

\[ P(\mathbf z_t | \mathbf z_0) \sim \mathcal N \left(\mathbf z_t \mid \sqrt{\tilde{\beta}_t} \mathbf z_0 , (1 - \tilde{\beta}_t ) \mathcal I \right) \]

  • We know the conditional distribution of the latent variable at any \(t\) given the input!

  • Called the diffusion kernel

  • Note that the mean necessarily goes to zero and the variance approaches identity!

Diffusion Models

Key note: we’ve “noised” the image, but we can track back to the original given the structure of the noise!

  • Up to probabilistic uncertainty, I know what \(\mathbf z_t\) is given \(\mathbf x\)

This means that I could, theoretically, learn a mapping back from noise to the original image!!!

  • The key idea that makes this diffusion model tick

Diffusion Models

At each time point, \(\mathbf z_t\) represents a more and more noised version of the input image

  • All images converge to a single point

  • The path is what tells us how to go from stable point back to image!

Regardless of where we start, we end up at \(\mathbf z_T \sim \mathcal N(\mathbf 0 , \mathcal I)\)!

Defines a full diffusion generator:

  • Each image is a draw from \(\mathbf z_T\)

  • Map through the backwards process

  • Get an image in the pixel space

Diffusion Models

The encoder defines a sequence of distributions:

\[ P(\mathbf z_t | \mathbf z_0) \text{ or } P(\mathbf z_t | \mathbf z_{t-1}) \]

Thus, the decoder should reverse these conditionals:

\[ P(\mathbf z_0 | \mathbf z_t) \text{ or } P(\mathbf z_{t-1} | \mathbf z_t) \]

  • Invert the encoder!

Diffusion Models

It may seem like this is going to be easy, but think back to QTM 110 and Bayes’ theorem:

\[ P(\mathbf z_{t-1} | \mathbf z_{t}) = \frac{P(\mathbf z_{t} | \mathbf z_{t-1})P(\mathbf z_{t-1})}{P(\mathbf z_{t})} \]

Unfortunately, we don’t know the two marginals!

\[ P(\mathbf z_t) = \int \int ... \int P(\mathbf z_t | \mathbf z_{t-1})P(\mathbf z_{t-1} | \mathbf z_{t-2})...P(\mathbf z_{1} | \mathbf z_{0}) d \mathbf z_0 \mathbf z_1 ... \mathbf z_{t-1} \]

  • Yeesh. No tricks here.

Diffusion Models

Important point:

Even though the forward direction conditional will be normal, by assumption, the reverse probably won’t be!

  • Because we assume normality in one direction, we must account for the data distribution in the other!

  • Can simulate for 1 dimensional model.

Diffusion Models

However, there is one reverse conditional that we can know:

\[ P(\mathbf z_{t - 1} | \mathbf z_t , \mathbf z_0) \]

  • The reverse conditional (e.g. backwards process) conditional on the original input!

  • Not useful when there is no original input

  • But, will be useful for training

Diffusion Models

\[ P(\mathbf z_{t-1} | \mathbf z_{t} , \mathbf z_0) = \frac{P(\mathbf z_t | \mathbf z_{t-1} , \mathbf z_0)P(\mathbf z_{t-1} | \mathbf z_0)}{P(\mathbf z_t | \mathbf z_0)} \propto P(\mathbf z_t | \mathbf z_{t-1} , \mathbf z_0)P(\mathbf z_{t-1} | \mathbf z_0) \]

  • Anyone see the trick here? Think about Markov processes

Diffusion Models

\[ P(\mathbf z_{t-1} | \mathbf z_{t} , \mathbf z_0) \propto P(\mathbf z_t | \mathbf z_{t-1})P(\mathbf z_{t-1} | \mathbf z_0) \]

We know both of these by definition of the forward process:

\[ \mathcal N \left( \mathbf z_t \mid \sqrt{1 - \beta_t}\mathbf z_{t-1} , \beta_t \mathcal I \right) \mathcal N\left(\mathbf z_{t-1} \mid \sqrt{\tilde{\beta}_{t-1}} \mathbf z_0 , (1 - \tilde{\beta}_{t-1}) \mathcal I \right) \]

Diffusion Models

Through some normal Bayes magic, we can show that this distribution can be cast in terms of \(\mathbf z_{t-1}\):

\[ P(\mathbf z_{t-1} | \mathbf z_t , \mathbf z_0) = \mathcal N\left(\mathbf z_{t-1} \mid \frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_{t}} \sqrt{1 - \beta_t} \mathbf z_t + \frac{\sqrt{\tilde{\beta}_{t-1}}\beta_t}{1 - \tilde{\beta}_t} \mathbf z_0 , \frac{\beta_t (1 - \tilde{\beta}_{t-1})}{1 - \tilde{\beta}_t} \mathcal I \right) \]

  • In words, the reverse conditional conditioned on the original input is also normal!

  • Given an input value, \(\mathbf z_t\), and a mixing rate, we can assess the likelihood of the previous latent variable!

  • Note that all of the terms with \(\beta\) are pre-determined and set!

  • We can “denoise” an image!

Diffusion Models

This diffusion model defines a diffusion mapping:

\[ P(\mathbf z_T) \sim \mathcal N(\mathbf 0 , \mathcal I) \]

\[ P(z_{t-1} | \mathbf z_t, \mathbf z_0) \sim \mathcal N\left(\mathbf z_{t-1} \mid \frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_{t}} \sqrt{1 - \beta_t} \mathbf z_t + \frac{\sqrt{\tilde{\beta}_{t-1}}\beta_t}{1 - \tilde{\beta}_t} \mathbf z_0 , \frac{\beta_t (1 - \tilde{\beta}_{t-1})}{1 - \tilde{\beta}_t} \mathcal I \right) \]

Big problem!

  • How can we use this to generate images? No \(\mathbf z_0\), no go.

Diffusion Models

Instead:

\[ Q(\mathbf z_{t-1} | \mathbf z_t, \boldsymbol \theta_t) \sim \mathcal N \left(\mathbf z_{t-1} \mid g(\mathbf z_{t}, \boldsymbol \theta_t) , \sigma^2_t \mathcal I \right) \]

  • Approximate the true reverse mapping without relying on \(\mathbf z_0\)

  • Come up with an approximate distribution that is as close as possible to \(P(z_{t-1} | \mathbf z_t, \mathbf z_0)\) without explicitly leveraging info about \(\mathbf z_0\) until the end of the diffusion process

  • Each time step is parameterized by its own set of value that map \(\mathbf z_t\) to \(\mathbf z_{t-1}\)

Diffusion Models

What we would like to optimize:

\[ \hat{\boldsymbol \theta}_{1,2,...,T} = \underset{\boldsymbol \theta}{\text{argmax }} \left[\sum \limits_{i = 1}^N \log P(\mathbf x_i | \boldsymbol \theta_{1,2,...,T})\right] \]

where:

\[ P(\mathbf x | \boldsymbol \theta_{1,2,...,T}) = \int P(\mathbf x , \mathbf z_1, ... , \mathbf z_T | \boldsymbol \theta_{1,2,...,T}) d \mathbf z_1 d \mathbf z_2... d \mathbf z_T \]

This is intractable.

Any ideas?

Diffusion Models

As with VAEs:

Let \(\mathbf Q(\mathbf z_1, \mathbf z_2, ... , \mathbf z_T | \mathbf x)\) be an approximation to the intractable reverse conditional, \(P(\mathbf z_1,\mathbf z_2, ..., \mathbf z_T | \mathbf x)\).

Maximize the evidence lower bound:

\[ \int Q\left (\mathbf z_{1...T} | \mathbf x \right) \log \left[\frac{P(\mathbf x , \mathbf z_{1...T}) | \boldsymbol \theta_{1...T}}{Q(\mathbf z_{1...T} | \mathbf x)}\right] \]

This ELBO is a little different than the VAE one since we need to deal with a sequence of latent variables instead of a single vector!

Diffusion Models

Witholding the math (mostly relying on the fact that the latent variables represent a Markovian process), we can show that the ELBO has the form:

\[ E_Q \left[\log P(\mathbf x | \mathbf z_1 , \boldsymbol \theta_1\right] - \sum \limits_{t = 2}^T E_Q \left[D_{KL} \left[P(\mathbf z_{t-1} | \mathbf z_t , \mathbf x) || Q(\mathbf z_{t-1} | \mathbf z_t , \boldsymbol \theta_t) \right] \right] \]

  • The first term is the is the likelihood of the ground truth images at the end of the latent variable chain - normal, by assumption

  • The second term is the KL divergence between two normals:

\[ P(\mathbf z_{t-1} | \mathbf z_t , \mathbf z_0) = \mathcal N\left(\frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_{t}} \sqrt{1 - \beta_t} \mathbf z_t + \frac{\sqrt{\tilde{\beta}_{t-1}}\beta_t}{1 - \tilde{\beta}_t} \mathbf z_0 , \frac{\beta_t (1 - \tilde{\beta}_{t-1})}{1 - \tilde{\beta}_t} \mathcal I \right) \]

\[ Q(\mathbf z_{t-1} | \mathbf z_t, \boldsymbol \theta_t) = \mathcal N \left(\mathbf z_{t-1} \mid g(\mathbf z_{t}, \boldsymbol \theta_t) , \sigma^2_t \mathcal I \right) \]

Diffusion Models

Since the KL divergence is between two normals, we can define an analytical loss function for training a diffusion model:

\[ \begin{align} & \sum \limits_{i = 1}^N - \log \mathcal N(\mathbf x_i | g(\mathbf z_{i1} , \boldsymbol \theta_1) , \sigma^2_1 \mathcal I) + \\ & \sum \limits_{t = 2}^T \frac{1}{2\sigma^2_t} \| \frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_t} \sqrt{1 - \beta_t} \mathbf z_{it} + \frac{\sqrt{\tilde{\beta}_{t-1}} \beta_t}{1 - \tilde{\beta}_t} \mathbf x_i - g_t\left[\mathbf z_{it}, \boldsymbol \theta_t \right]\|^2 \end{align} \]

  • The first term is the reconstruction term

  • The second is the distance between the target mean of the conditional and the approximated mean!

This is trainable!

Diffusion Models

Practical notes:

Diffusion models are often reparameterized in terms of the noise. This makes the model a little easier to train, but loses the theoretical simplicity of the original construction.

  • A small change that makes everything a little easier

\[ Q(\mathbf z_{t-1} | \mathbf z_t, \boldsymbol \theta_t) \sim \mathcal N \left(\mathbf z_{t-1} \mid g(\mathbf z_{t}, \boldsymbol \theta_t) , \sigma^2_t \mathcal I \right) \]

  • Each diffusion step is associated with its own set of parameters!

  • This means that we should train different neural networks for each diffusion step

  • In general, this is not plausible

Diffusion Models

For images, one approach is to create a deep nonlinear mapping between the diffused image at time \(t-1\) and time \(t\)

  • Learn a set of parameters that will take in an image and return a diffused image of the same size

  • Image to image prediction

Anyone got a proposal for a method that does that?

  • Takes in an image and returns a different image?

Diffusion Models

Diffusion Models

So, at each transition we need a full UNet.

  • That doesn’t seem too plausible.

Instead, a clever trick:

  • Model each step as a step in a UNet

  • But, encode time using positional embeddings

  • Where have we seen time and sequences arise previously?

Diffusion Models

Diffusion Models

A single UNet:

  • Each block represents a time step in the diffusion process

  • Treated like a sequential step in a transformer model with self-attention (the encoder side)

  • Self-attention steps that look across all time points in the diffusion process to figure out how different diffusion steps relate to one another

  • Time embedding helps to create locational dependencies among the time steps

Diffusion Models

A little more scalable!

But, will still require a lot of time steps to effectively generate realistic looking images.

  • In general, diffusion models work better when \(T\) is large and \(\beta\) is really close to zero

  • Think of it as creating a lot of very small noise steps between the original image and the stable point

  • More small steps mean more places for differences between images

Diffusion Models

Diffusion is a lot slower than other generative models

  • But, the self-attention UNet has many places for GPU parallelization
  • Kinda the whole point of using self-attention in place of explicit stepping

Any time you see a transformer, you should immediately think that it requires a lot of data and a lot of computational resources!

We normal humans aren’t going to be able to really train good diffusion models

  • But, pre-trained models will help us out a lot here!

Stable Diffusion

Stable Diffusion

Stable Diffusion

Stable Diffusion

Stable diffusion is a recent diffusion generative model

  • Turn text prompts into high quality, high resolution images

A collection of a number of different advances in generative and discriminative modeling!

Inputs:

A set of \(N\) images and \(N\) associated text strings/prompts

Stable Diffusion

Stable Diffusion

Multiple parts. All things we’ve seen before!

Step 1: VAE compression

  • Pass the original image through a VAE encoder to transform the high res image into a smaller, noisier space

  • The latent variable for the VAE is structured to be another image - instead of a vector, the latent space is a 32x32 version of the original image

  • A little noisier and smaller than the original input

  • We know that VAEs work, but create blurry images. Learn what we can from a fast VAE and then let the diffusion model learn the rest!

Stable Diffusion

Stable Diffusion

Step 2: Forward DIffusion

  • Create a sequence of diffusion steps

  • Add a little noise at each step

  • For training, hundreds of steps with very small noise variance!

  • This part is the least computationally intense.

Stable Diffusion

Step 3: Compute prompt embeddings

  • Given different prompts, compute an embedding for each prompt.

  • Most frequently, use a pre-trained BERT model with 512 or 1024 dimensional embeddings

Stable Diffusion

Step 4: Denoising UNet

  • UNet with positional embeddings to translate fully diffused image back to pixel space

  • Cross-attention with prompt embedding at each step

  • Self-attention with all diffusion time steps at each step

The money maker:

  • With enough GPUs, forego the time embedding shortcut and train a separate UNet for each time transition!

  • Better quality images since there are way more parameters.

  • We mere mortals can’t replicate this part…

Diffusion Models

Stable Diffusion

Step 5: VAE decoder

Take the output image from the denoising UNet and decode it using the trained VAE.

  • Go from low-res back to high res!

Stable Diffusion

This entire process is backpropable

  • Can be trained as one large deep network!

Leads to an absolutely insane level of image quality that largely matches the prompt!

Stable Diffusion

We probably can’t train our own stable diffusion models.

But, we can use pre-existing ones!

Stable Diffusion

Stable diffusion is the new hotness

  • Like GPT, it is only usable by those with the most GPU resources

  • But, generated image quality is higher than that of GANs

  • Text to Image conditioning is easier in Stable Diffusion than with Conditional GANs, so more control over what gets sampled.

Stable Diffusion

Transformer style diffusion models seem to work really well

  • More research is needed to understand why this seems to work so much better than self-attention GANs

  • GANs and VAEs are way more computationally efficient

  • THis can be your research area…